1 THOUGHTS and TODO:

2 testing slide

class one

class two

class three

class four

class five

class six

class seven

class eight

class nine

class ten

class eleven

class twelve

3 Why?

3.1 Embarassingly parallel tasks

These are computational tasks which involve many separate, independently executable calculations. Some common statistical examples of embarassingly parallel processes:

  • bootstrapping
  • cross-validation
  • simulating independent random variables (dorng)

In contrast, some sequential or non-parallel processes:

  • MCMC algorithms
  • stepwise model selection (e.g.: step())

For loops that do not explicitly involve dependent calculations are wasteful, especially if we have multiple processors available. Perhaps even worse, the time cost of using a wasteful approach can put some useful statistical tools beyond our reach!

3.2 Options in R

  • changing from a for loop to one of the apply() functions can help, but still doesn’t use multiple processors
  • parallel package (thanks, Miranda!)
  • don’t use R
  • foreach packages!

3.3 Why foreach?

We would like to find a way to make use of our whole computer, and make useful tasks like bootstrapping available to use, but without having to invest large amounts of time in learning new programming languages. Enter foreach, which keeps the structure of a for loop, but allows us to drop two key assumptions:

  • sequentiality
  • single processor architecture

Our goal: transform a traditional for loop into a foreach loop We will begin with a simple chunk of R code involving a for loop, and transform it. Along the way, we’ll take a look at the equivalent computation done with an apply() function, and see that using foreach and multiple processors outperforms this too.

4 The data and research question

We are going to look at data from the New York City bikeshare program (Citibike).

We want to find a model which can offer good prediction. Start with a few plausible models and use K fold cross validation to decide which one to use.

4.1 some nice figures

5 Fitting GLMs and extracting prediction error

We now suppose that we’re considering three increasingly complex models of the arrival behavior. In order to compare three candidate models prediction error, we’ll use K-fold cross validation, where we use the same folds for all three models.

Here’s some familiar code that sets things up:

source("~/git/fixschewitt/foreach/citibike_new_york/EDA/subsetting_arrivals_attributes.R")
lq.loss <- function(y, y.hat, q = 1) {(abs(y - y.hat))^q}
get.errs <- function(test.set = NULL,
                     discarded = NULL,
                     q = q) {
    sml.glm <- glm(arrivals ~
                   bs(hour, degree = 4)
                   + weekend
                   + as.factor(id),
                   data = arrivals.sub[-c(discarded, test.set), ],
                   family = "poisson")
    med.glm <- glm(arrivals ~
                   bs(hour, degree = 4)*weekend
                   + as.factor(id),
                   data = arrivals.sub[-c(discarded, test.set), ],
                   family = "poisson")
    big.glm <- glm(arrivals ~
                   bs(hour, degree = 4)*weekend
                   + bs(hour, degree = 4)*as.factor(id),
                   data = arrivals.sub[-c(discarded, test.set), ],
                   family = "poisson")
    sml.err <- mean(lq.loss(predict(object = sml.glm,
                                    newdata = arrivals.sub[test.set, -7],
                                    type = "response"),
                            arrivals.sub[test.set, 7],
                            q = q))
    med.err <- mean(lq.loss(predict(object = med.glm,
                                    newdata = arrivals.sub[test.set, -7],
                                    type = "response"),
                            arrivals.sub[test.set, 7],
                            q = q))
    big.err <- mean(lq.loss(predict(object = big.glm,
                                    newdata = arrivals.sub[test.set, -7],
                                    type = "response"),
                            arrivals.sub[test.set, 7],
                            q = q))
    return(c(sml.err, med.err, big.err))
}

Next, we make our K-fold test sets (and implicitly, our training sets):

K <- 50
N <- dim(arrivals.sub)[1]

## kill off 8 observations and make cv test sets
set.seed(1985)
discarded <- sample(1:N, size = 8)
cv.test.sets <- matrix(sample((1:N)[-discarded], size = N - 8), ncol = K)

5.1 K-fold CV with a for loop

Using a naive for loop, we could implement this as:

err.for <- NULL
system.time(
    for (i in 1:K) {
        err.for <- cbind(err.for, get.errs(test.set = cv.test.sets[, i],
                                           discarded = discarded,
                                           q = 1))
        }
    )
##    user  system elapsed 
##  17.051   0.925  18.118

5.2 K-fold CV with an apply function

If you’re good with apply() functions you might upgrade to

## apply version
system.time(
    err.apply <- sapply(X = 1:K, 
                        FUN = function(i) {
                            get.errs(test.set = cv.test.sets[, i],
                                     discarded = discarded,
                                     q = 1)
                            }
                        )
    )
##    user  system elapsed 
##  16.126   0.956  17.129

Neither of the first two methods take advantage of multiple processors. While apply() functions avoid the inherently sluggish nature of for loops in R, they are still ignorant of the processor structure. We want to chop the job into halves, fourths, etc. and use the whole computer!

5.3 K-fold CV with a foreach loop

Here is the same computation written with a foreach() loop

## foreach version
library(foreach)
library(doParallel)
registerDoParallel(cl = 4)
system.time(
    err.foreach <- foreach(i=1:K,
                           .inorder = FALSE,
                           .combine = "cbind") %dopar% {
                               get.errs(test.set = cv.test.sets[, i],
                                        discarded = discarded,
                                        q = 1)
                               }
    )
##    user  system elapsed 
##  22.395   1.585   9.731

6 Components of a foreach loop

7 Results

plot of chunk errs

8 Additional topics

8.1 iterators

Sometimes the list or vector that we are iterating over (in the above case, the vector 1:N) can be a very large object. In our case, the vector is quite reasonable in size, but perhaps the object we were iterating over was a multi-dimensional object, with many values, and a high level of precision. In this case, we’d be storing a massive object, which could potentiall fill up our useable memory and slow things down. We could save memory by only keeping the piece of our list we need for a given calculation, and dumping the rest. This is the idea behind the iterators package in R.

Here is our same example with an iterator instead of a list.

library(iterators)
registerDoParallel(cl = 4)
system.time(
    err.foreach.iter <- foreach(x = iter(cv.test.sets, by = "col"),
                               .inorder = FALSE,
                               .combine = "cbind") %dopar% {
                                   get.errs(test.set = x,
                                            discarded = discarded,
                                            q = 1)
                                   }
    )
##    user  system elapsed 
##  25.115   1.119  10.176

8.2 Random numbers

When parallelizing a process which generates random numbers we need to be careful, since we aren’t guaranteed that foreach will handle this properly. We could wind up getting numbers that aren’t in fact random! Moreover, if we want to be able to reproduce an analysis which uses a random number generator, the usual set.seed() isn’t guaranteed to work.

Fortunately, there is doRNG. There are many ways to implement this package, the two easiest of which are:

library(doRNG)
registerDoParallel(cl = 4)
blah1 <- foreach(x = 1:10, 
                 .options.RNG = 1985,
                 .combine = 'c') %dorng% {
                     rnorm(1)
                     }

and

registerDoParallel(cl = 4)
registerDoRNG(seed = 1985)
blah2 <- foreach(x = 1:10,
                 .combine = 'c') %dopar% {
    rnorm(1)
    }

Note that this gives reproducible results!

sum(blah1 != blah2) 
## [1] 0

8.3 packages that support foreach

Some packages come with parallel functionality built in via foreach.

  • glmnet
  • gam
  • ggmcmc
  • plyr
  • many others

9 Fix up the following code